Analytical evaluation of term weighting schemes for text categorization
نویسندگان
چکیده
1 An analytical evaluation of six widely used term weighting techniques for text cate2 gorization is presented. The analysis depends on expressing the term weights using term 3 occurrence probabilities in positive and negative categories. The weighting behaviors of 4 the schemes considered are firstly clarified by analyzing the relation between the occur5 rence probabilities of terms which receive equal weights. Then, the weights are expressed 6 in terms of ratio and difference of term occurrence probabilities where the similarities and 7 differences among different schemes are revealed. Simulations show that the relative per8 formance of different schemes can be explained by the ways they use ratio and difference 9 of term occurrence probabilities in generating the term weights. 10
منابع مشابه
Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملA Novel Term Weighting Scheme Midf for Text Categorization
Text categorization is a task of automatically assigning documents to a set of predefined categories. Usually it involves a document representation method and term weighting scheme. This paper proposes a new term weighting scheme called Modified Inverse Document Frequency (MIDF) to improve the performance of text categorization. The document represented in MIDF is trained using the support vect...
متن کاملInverse Category Frequency based supervised term weighting scheme for text categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملProbabilistic Supervised Term Weighting for Binary Text Categorization
In text categorization, the class agnostic (unsupervised) tf× idf term weighting scheme has seen widespread usage. Recently proposed supervised term weighting methods including tf×rf and tf× δidf make use of term class distribution to improve the classification accuracy. However, they only account for the presence of terms in classes, ignoring the absence of key categorical terms, which may giv...
متن کاملEmpirical Evaluation of Centroid-based Models for Single-label Text Categorization
Centroid-based models have been used in Text Categorization because, despite their computational simplicity, they show a robust behavior and good performance. In this paper we experimentally evaluate several centroidbased models on single-label text categorization tasks. We also analyze document length normalization and two different term weighting schemes. We show that: (1) Document length nor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Pattern Recognition Letters
دوره 31 شماره
صفحات -
تاریخ انتشار 2010